Revisiting Waiting Times in DNA Evolution* 



Pierre Nicodeme 

LIPN - Team CALIN, CNRS-UMR 7030, University Paris North, 
Institut Galilee , 99, Avenue Jean-Baptiste Clement, 93430 - Villetaneuse, France 
phone: +33 (0)14940-4069, fax: +33 (0)14826-0712, pierre.nicodeme@lipn.univ-parisl3.fr 

March 5, 2013 



Abstract 

Transcription factors are short stretches of DNA (or fc-mers) mainly 
located in promoters sequences that enhance or repress gene expres- 
sion. With respect to an initial distribution of letters on the DNA al- 
phabet, Behrens and Vingron [3] consider a random sequence of length 
n that does not contain a given fc-mer or word of size k. Under an evo- 
lution model of the DNA, they compute the probability p n that this 
fc-mer appears after a unit time of 20 years. They prove that the wait- 
ing time for the first apparition of the fc-mer is well approximated by 
T n = 1/pjj. Their work relies on the simplifying assumption that the k- 
mer is not self-overlapping. They observe in particular that the waiting 
time is mostly driven by the initial distribution of letters. Behrens et 
al. [2] use an approach by automata that relaxes the assumption related 
to words overlaps. Their numerical evaluations confirms the validity 
of Behrens and Vingron approach for non self-overlapping words, but 
provides up to 44% corrections for highly self-overlapping words such 
as AAAAA. We devised an approach of the problem by clump analysis 
and generating functions; this approach leads to prove a quasi-linear 
behaviour of p n for a large range of values of n, an important result for 
DNA evolution. We present here this clump analysis, first by language 
decomposition, and next by an automaton construction; finally, we 
describe an equivalent approach by construction of Markov automata. 



1 Introduction 



Several theoretical studies have tried to give a probabilistic explanation 
for the speed of changes in transcriptional gene regulation (e.g. [2], [3]). 
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Behrens and Vingron |3j infer how long one has to wait until a given Tran- 
scription Factor (TF for short) binding site emerges at random in a promoter 
sequence. Using a Bernoulli probabilistic model denoted by MO and esti- 
mating evolutionary substitution rates based on multiple species promoter 
alignments for the three species Homo sapiens, Pan troglodytes and Macaca 
mulatta, they compute the expected waiting time for every k-mer, k ranging 
from 5 to 10, until it appears in a human promoter. They conclude that the 
waiting time for a TF binding site is highly determined by its composition 
and that indeed TF binding sites can appear rapidly, i.e. in a time span 
below the speciation time of human and chimp. 

However, in their approach, Behrens and Vingron [3] rely on the as- 
sumption that if a k-mer of interest appears more than once in a promoter 
sequence, it does not overlap with itself. This particularly affects the wait- 
ing times for highly autocorrelated words like e.g. AAAAA or CTCTCTCTCT. 
Using automata, Behrens et al. [2] relaxed this assumption and, thus, more 
accurately compute the expected waiting times until appearance of /c-mers 
in a promoter of length 1000. 

While Behrens and Vingron [3] and all preceding works were mostly in- 
terested in sequences of fixed length n = 1000, Behrens et al. [2] realized 
that p n behaves asymptotically linearly with n for a wide range of lengths. 
This observation followed from a singularity analysis performed by the au- 
thor of the present article; this property is biologically important, since the 
lengths of promoters are in an approximate range from 1000 base pairs to 
2000 base pairs; moreover, it cannot be deduced easily from the rigorous 
computations by automata of Behrens et al. [2J. We give here proofs of 
this property that are based on clump analysis and use either combinatorics 
and language decompositions or automata constructions. Our adaptation of 
previous methods is new and has theoretical and practical interests. 

We present the model in Section [2} We recall in Section [3] the Behrens- 
Vingron equations (2010) and the automaton approach of Behrens et al. 
(2012). The main part of the article is devoted in Section [4] to counting the 
number H n of putative-hit positions in random sequences of length n; at first 
order, the probability p n is then a linear function of H n . We provide in this 
section the background for the Guibas-Odlyzko language decomposition and 
its extension to clump analysis, and a parallel construction by automata. 
This section also contains the translation to generating functions of the 
formal languages used and states our result of quasi-linearity of p n ; the 
proof of this result is given in Section [A] of the Appendix. Section [5] sketches 
a proof by automaton that does not rely on clump constructions. 



2 



A) Estimations for u(a), a £ A: 



v{A) v{C) v(G) u(T) 
0.23889 0.26242 0.25865 0.24004 



B) Estimations for p a ^(l), a, (3 £ A: 





A 


C 


G 


T 


A 


9.99999996e-01 


4.54999995e-09 


1.57499996e-08 


3.40000002e-09 


C 


6.14999993e-09 


9.99999996e-01 


7.14999985e-09 


2.17499994e-08 


G 


2.17499994e-08 


7.14999985e-09 


9.99999996e-01 


6.14999993e-09 


T 


3.40000002e-09 


1.57499996e-08 


4.54999995e-09 


9.99999998e-01 



Table 1: Parameter estimations. Numbers taken from [3]. 



2 Preliminaries 

Throughout the article, we assume that promoter sequences evolve according 
to model M0 which has been described by [3]. 

Model MO. Given an alphabet A = {A,C,G,T}, let 5(0) = (5i(0),..., 
5 n (0)) denote the initial promoter sequence of length n taking values in 
this alphabet. We assume that the letters in 5(0) are independent and 
identically distributed with u(x) := Pr(5i(0) = x). Let the time evolution 
(5(i))t>o of the promoter sequence be governed by the 4x4 infinitesimal rate 
matrix Q = {r a ,p)a,^A- The matrix F(t) = (p a ,p(t)) a ,i3eA containing the 
transitions probabilities of a evolving into f3 in finite time t > 0, (a, /3 £ AY, 
can be computed by P(i) = e'^ 1 ; see Karlin- Taylor [8], p. 150-152. Table 111 
provides the parameters used. 

The expected waiting time. Given a binding site 

b=(bi,...,b k ) where bi, ... ,b k £ A, (1) 

the aim is to determine the expected waiting time until b emerges in a 
promoter sequence of length n provided that it does not appear in the initial 
promoter sequence 5(0). More precisely, let 

T n = inf{t £ N : 3t £ {1, . . . , n - k + 1} 

such that (5j(i), . . . , S i+k -i(t)) = (&i, . . . , b k )}. 
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Then, given that Pr(6 occurs in 5(0)) = 0, the waiting time T n has approx- 
imately a geometric distribution with parameter p n verifying 

p n = Pr(6 occurs in generation 1 | b does not occur in generation 0) (2) 
= Pr(6e 5(1) | 6 5(0)). 

See [3j. In particular, one has E(T n ) ~ — . 

Pn 

3 Previous work 

We present briefly Behrens and Vingron [3] and Behrens et al. [2] methods. 
3.1 Behrens-Vingron (2010) 

Considering the fc-mer b = b± . . . bf. , Behrens and Vingron consider (i) the 
probability that it occurs at time t = 1 in 5(1) and (ii) the probability that 
it evolves from a fc-mer of 5(0). Case (i) is computed by inclusion-exclusion 
and by assuming that the word b is not self-overlapping. This gives Q 

Pr(\S(l)\ b > 1) = E(-l) m " { \ " Pr(|5(l)| 6 = i). 

Taking in account the evolution probability, they consider next the words at 
substitution distance 1 to k of b. Assuming that the insertion of such words 
within a sequence 5(0) with no occurrence of b does not create an occurrence 
of b (this is wrong in general, but a good approximation for non-overlapping 
words long enough in a 4 letters alphabet), they obtain 

Pr ( |5(1)| 6 > 1 | |5(0)| 6 = ) « ^ J (-l) m ( n ~ {k ~ 1)l )pi, (3) 

( k V 

with pe=\ Yl v(ai)---v(a>k)J\pa i ,b i {l)l ■ (4) 

\(a 1 ...a k )eA k \{b 1 ...b k } i=l / 

In the last equation, pi is the approximate probability that b occurs t times 
in 5(1) while not occurring in 5(0). 



1 We use the notation IwL to note the number of occurrences of a word u in a word w. 
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3.2 Automaton approach of Behrens et al. (2012) 



Behrens et al. [2] use the following algorithm. Let = (Q:={0, 1, ...,&}, 5&, 
0, {A;}) be the Knuth- Morris-Pratt automaton over the alphabet A that rec- 
ognizes the language A*bA*. The language A*\A*bA* is recognized by the 
automaton A5 = (Q, 5b, 0, {0, 1, . . . , k — 1}). They construct a product au- 
tomaton P = A;, (g) A;, on A x A such that 



P = (Q x Q,(5,(0,0),F), with 



F = {0,l,...,fc-l}x {A:}. 



They weight (i) any transition g — > q' of A^ by i/(a) and (ii) any transition 

c — — \ d of P by via) x p a ^ a ', where v is the initial distribution of letters 
and p x ^y is the probability of evolution of letter x to letter y in a unit time. 
Considering the corresponding adjacency matrices and P, (provided a 
suitable reordering of the lines and columns of the matrices), Vf being a 
column vector with a one for each final state in P and zero elsewhere, the 
probability p n verifies, 

p n = (1,0,..., 0) xf x V F I (1,0, ... ,0) xAj x (l7T.^T,0)*. 

Table [2] provides the top 10 5-mers with respect with the correction done 
by Behrens et al. (2012) with respect to Behrens- Vingron (2010). 
Considering the minimal period m(b) of a /c-mer b, such that 

m(b) = min(i, \u\ = i; b = u l .v, v prefix of u), 

and noting i-periodic a word with minimal period i, half of the 5-mers, 
two-thirds of the 7-mers and all of the 10-mers with ^b nn (J 100 °) > 1,95 are 
either 1- or 2-periodic, i.e. show a high degree of autocorrelation. This 
implies that, for only 4% of the 5-mers, 0.2% of the 7-mers and 0.002% 
of the 10-mers, the exact computations of Behrens et al. (2012) differ by 
more than 5% of the approximate computations of Behrens- Vingron (2010). 
However, as shown in Behrens et al. (2012), a non negligible number of 
Transcription Factors are highly correlated. 



4 Clump approach 

Table [2] shows clearly the importance of autocorrelation. Assuming a four 
letters alphabet with a uniform probability distribution, founding an occur- 
rence of AAAAA at a position, up to boundary effects, we have a probability 
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BNN 




BV 








T7i (rp \ /-] f\Q 

■•^BNN^IOOOJ/ J-U 








EbnnCTlooo) 
Ebv(^iooo) 


recce 


9 1 05 


1 091 




i 

± 


1 44 


GGGGG 


9.570 


1022 


6.666 


142 


1.44 




10.401 


1023 


7.457 


993 


1.39 


AAAAA 


10.656 


1024 


7.654 


1024 


1.39 


CGCGC 


7.047 


699 


6.446 


11 


1.09 


TCCCC 


7.076 


737 


6.477 


17 


1.09 


CCCCT 


7.076 


738 


6.477 


21 


1.09 


GCGCG 


7.127 


787 


6.518 


31 


1.09 


CTCTC 


7.263 


883 


6.679 


148 


1.09 


CACAC 


7.337 


945 


6.750 


217 


1.09 



Table 2: Expected waiting times (generations) for 5-mers in model 
MO with ^bnnCTiooo) > L09> ( T 1Q results from Table 2 f Behrens et 

al. [2]). Ebv(^iooo) denotes the expected waiting time according to Behrens- 
Vingron [3] (BV) and Ebnn(^iooo) according to the automaton approach of 
Behrens et al. [2] (BNN). Ranks refer to 5-mers sorted by their waiting time 
of appearance according to the two different procedures BV and BNN; rank 
1 is assigned to the fastest evolving 5-mer, rank 1024 (=4 5 ) to the slowest 
emerging 5-mer. 



1/4 of finding an occurrence shifted by one position. In contrast, consider- 
ing an occurrence of AACCC, we need reading at least 5 new letters to find a 
new occurrence, and the probability of finding two consecutive occurrences 
is 1/4 5 . This is a well known fact in combinatorics of words; words occur by 
clumps and, while clumps of a non-overlapping word have only one occur- 
rence of the word, clumps of an overlapping word may have several; since 
the probability (in a uniform model) of occurrence of any word of a given 
size at any position is the same, the proportion of text covered by clumps 
of a non-overlapping word will be larger than this of an self-overlapping 
word. This property extends to sets of words depending of their self-overlap 
structure. 

We show here that the number of positions in 5(0) that can mutate and 
provide an occurrence of a fe-mer b in 5(1), or putative- hit positions, is not 
a function of the number of occurrences of b in 5(1) or of the neighbours 
of b in 5(0), but that this number can be computed by a variant of clump 
analysis. 
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CCCAACAACAACCCCCCCCAACACCACA 



CCCCAAACAAACAAACAAAACACAAC 
CCC . .AAC .AAC .AAC AAC 



CCC 



ACA 



ACA 
ACA 
AAC 



I II III IV 



V 



CAA 
AAC 
ACA 
CAA 
AAC 
CAA 
AAC 



I 



CAA 
AAC 
ACA 



II 



.ACA 



III 



Figure 1: Clumps and putative-hit positions. Sequences <S&(0) for b = 
ACC (left) and S b >{0) for b' = AAA (right). The sequence S b (0) (resp. 5ft/ (0)) 
avoids the fc-mer b (resp. b'). Putative-hit positions are underlined and in 
red. Clumps are shown at their respective position under the sequences. 
Note that extensions to the right of clumps of the set d(AAA) for b' = AAA, 
while creating a new occurrence of a word of the set, do not add necessarily a 
new putative-hit position; clump I (right) contains 7 occurrences of d(kkk), 
but only 4 putative-hit positions for b' = AAA. Therefore the number of word 
occurrences is not the relevant statistics for precisely counting putative-hit 
positions. Note also in the clump I for b = ACC (left) that, when the right 
extension of a clump adds a new putative-hit position, this position is not 
necessarily in the extension, but possibly backwards left. 



Notations. Given a word b, we note d(b) the set of its neighbours at edit 
distance 1 (by substitution of one letter), and de(b) the vector resulting of 
a lexicographic sort of d(b). Therefore, for an alphabet A = {A, C}, we have 
d(ACC) = {CCC, AAC, ACA} and d e (kCC) = (AAC, ACA, CCC). 
The correlation set C Vl>V2 of two words v\ and V2 is defined as usual, 

C Vl<V2 = { e | there exists e' € A + such that vie = e'v2 with |e| < }. 

When we have w = v\ = V2, we get C w<w = C (the autocorrelation of w). 

Putative-hit positions. Given a sequence S(0) not containing a k-mer 
b, a putative-hit position is any position of S(0) that can lead by a mutation 
to an occurrence of b in 5(1), where we assume that a single mutation has 
occurred. We have for instance 

5(0) = CCCAACAC, b = ACC 5(0) = CCCAACAC, 

where the putative- hit positions are underlined in 5(0) . Mutating any single 
putative-hit position of 5(0) leads to a sequence 5(1) with an occurrence of 
b = ACC. 
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Examples of sequences S(0) for the 3-mers ACC and AAA (see Figure [T]) 
reveal that the right method to carry on the computation of putative-hit 
positions is clump analysis pQ. 

Aim of the computation. In the following, H n is the random variable 
counting the number of putative- hit positions in a random sequence of length 
n. We consider the generating function Fb(z, t) that counts the number of 
putative-hit positions for the /c-mer b in texts avoiding this k-mer, 



where At is the set of sequences of any length that do not contain the A:-mer 
b and put-hit-pos(u;) is the number of putative-hit positions of the word 
w. Note that, up to probability of second order small magnitude, only one 
putative-hit position will mutate. 

4.1 Analysis "a la Guibas-Odlyzko" 

Considering a reduced set of words (no word is factor of another word in 
the set), Regnier and Szpankowski [T^l US] arid Regnier [TT] use (as an 
evolution of Guibas and Odlyzko previous work [6j [7j) a natural parsing 
or decomposition of texts with respect to the occurrences of the set. 

We follow here the corresponding presentation of Lothaire |9j (Chapter 
7). Let V = {v%, . . . , v r } be a reduced set of words. We have, formally 

Definition 4.1 Right, Minimal, Ultimate and Not languages. 

- The "Right" language TZi associated to the word m is the set of words IZi = 
{r\r = e- Vi and there is no v G V such that r = xvy with \y\ > 0}. 

- The "Minimal" language M.ij leading from a word Vi to a word Vj is the set 
of words Mij = {m \ i>j • m = e • v j and there is no v € V such that Vi ■ m = 
xvy with \x\ > 0, \y\ > 0}. 

- The "Ultimate" language of words following the last occurrence of the word 
Vi (such that this occurrence is the last occurrence ofV in the text) is the set 
of words lAi = {u\ there is no v £ V such that m ■ u — xvy with \x\ > 0}. 

- The "Not" language M is the set of words with no occurrences of V, J\f = 
{n | there is no v G V such that n = xvy}. 

It is possible to obtain the generating functions of these languages by com- 
binatorics and by new automata constructions. 
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4.2 Regnier-Szpankowski equations 

Considering the matrix M = (Mij) and using C{j — Cvi,Vj as a shorthand, 
we have 



(J [M k j = A*- Vj + C„ - (%e, = [jMj +W» - e, (6) 

fc>i iJ i 

A-Kj - (Kj - Vj) = (J vMij, M-Vj = TZj + (J Tfcj (Cy - <%e) ,(7) 

where the Kronecker symbol <5y is 1 if i = j and elsewhere. These equa- 
tions are non-ambiguous and translate to generating functions, where for a 
language £ and its generating function L(z), we have L{z) = Pr(w)z^ . 

Translating the system of Equations ([6j[7]) to generating functions and solv- 
ing the resulting system provide the generating functions Ri(z), Mij(z), 
Uj(z) and N{z) of the Right, Minimal, Ultimate and Not languages. The 
parsing by languages is now reflected in the following equation 



' N(z) + (R 1 (z),...,R r (z))(I-M(z))- 1 



1 - z 



( Ui(z) 

\ U r (z) 



(8) 



where — - — is the generating function of A*, the set of all texts. 
1 — z 

4.3 New automata constructions 

The languages 1Zi,Mij,Uj,J\f are recognized by the following automata 
(where (^) is the usual automaton product): 

Ui = A*.Vi (g) Not{A*v s A) ViMij = ViA* (g) A* .vj (^) Not(A + v s A' ] 

se{l,...,r} se{l,...,r} 



VjUj = vjA* Nat{A + v 8 A*) M = Not I (g) A*v s A* 

se{l,...,r} \se{l,...,r} 

4.4 Constrained languages 

Language approach. Considering the word b = AAA, and ^(AAA) = 
(AAC, ACA, CAA), we can compute from the vector of words (AAC, ACA, CAA, AAA) 
a row vector of Right languages (72-i , 72.2, 7^-3, 7^-4 )> a matrix of Minimal lan- 
guages {Mij) with i and j from 1 to 4 and a column vector of Ultimate 
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languages (Ui, UiiU"&M±V ■ Extracting the languages with indices from 1 to 
3 provides us for g^(AAA) with the Right TZi = TZi, Minimal = Aiij and 
Ultimate Uj = Uj languages avoiding the word AAA. 

The construction given here is fully general. For any finite alphabet A 
and any word b, it is (at least theoretically) possible to solve the Regnier- 
Szpankowski equations for the extended sequence (&i, ... , b r , b) where di(b) = 
(b\, . . . , b r ), which provides the constrained languages for de(b). 

Automata approach. It is also immediate to construct by automata the 
constrained languages. For instance, we have, for i,j G {1,2,3}, 

ViMij = ViA* (g) A*.Vj (g) Not(^ + u s ^l + ) (g) Not(A*bA*), 

se{l,2,3} 

and the general case follows also easily. 



4.5 Clump equations by language decomposition 

Bassino et al. Q] modify the Regnier-Szpankowski analysis of reduced sets 
to more specifically consider clumps of occurrences, where a clump is consti- 
tuted either (i) of a single isolated (with no overlap with other occurrences) 
occurrence of a word of the pattern, or (m) of a maximal set of occurrences 
where each occurrence overlaps at least another one. 

We consider the residual language V = C.w~ as V = {v, v ■ w £ £}. 
Considering two languages C\ and £2, we write £2 — £1 = £2 \ £1 = {v, v G 
C 2 ,v £1}. 

The clumps can be generated by a matrix of codes IfC = (/Qj ) . With 

r- k> t2 a+ -1 \ B ij =C ij nMi j 'di^j, 

>C, j = B„-B„A+ and j B „ = (Cij _ £ , n ») 

the language decomposition by clumps for a pattern V = {v\, . . . , v r } is 



A* =U+(TZ 1 v^,...,TZ r v-)G^{M-K)-Gy : , with 



= (/Cy), § = K*, 
= (v$ij) 

(10) 



Example 4.2 For the word w = AAAA, we have C = {e, A, A, AAA} and K. = 
{A}. For the pattern V = {TATAT, CATAT}, we have C C atat,tatat = {AT, 
ATAT} and /C C atat,tatat = {AT}. For the pattern V = {CAA, AAT, AAA}, we 
have Ccaa.aat = {T, AT} and /C C aa,aat = {T}. 
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Constrained clumps. The finite code languages generating the correla- 
tion languages of two words are easy to compute directly; one must however 
also avoid the forbidden word b while extending clumps. We therefore define 
for Vi (resp. Vj) the i-th (resp. j'-th) entry of the sequence d^(b) 

JQj = {he Kij] \vi.h\b = 0}, 

where \g\b is again the number of occurrences of the word b in the word g. 
Since the sets /Qj are finite, the computations of the codes /Qj can be done 
by string-matching. 



Gathering everything, we obtain a constrained version of Equation (10) 
for the language At of texts avoiding the word b, 

_ " _ .*pA f 

Al=M+(TZ 1 v^,...,TZ r v^)G((U-K)-G) : , witlW § = W, 

\uj ( G = (v&ij) 

(11) 

4.6 Computing the generating function of the number of 
putative-hit positions 

We prove that the computation of the generating function F(,(z,t) of Equa- 
tion [5] follows from Equation ( |11[ ). Indeed, taking in consideration the 
lengths of the words and the number of occurrences of putative-hit posi- 
tions, we have first V{(z,t) = Pr(vi)tz^ for each Vi G d(b). Next, for each 
fCij, we can compute by string matching the number of putative-hit positions 
in each word of Uj./Cy. This gives 



IC i:j (z, t)= ^ Pr(u;)t put - hit - pos ^- u '^ 1 z |u ' 

WdKij 



where we substracted the putative-hit position occurring within v% 
From the last equation and Equation (llll), we get 



K(z,t) = t)) , S(z,t) = (I-K(z,t)) \ 

G(z,t) = (viiz,t)§ijiz,t)y (12) 



Substituting in Equation ( 11 ) G by G(z, t) and JV, TZiV^ , (M— K) and Ui with 
1 < i < r by their translations to generating functions (that depend only of 
the variable z) provides the expression of Fb(z,t) that has been formally 
defined in Equation^. 
We also have 
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F b (z, 1) = At(z, 1) = = 5>r(S n (0) A*bA*) x 



(13) 

n>0 n>0 

where f^' is the probability that a random sequence of length n does not 
contain the word b. This implies that the conditionnecj^] expectation E(.ff n ) 
of the number of putative-hit positions verifies 



E(H n ) = E(H n )/f® = [z n ] 



dF b (z,t) 



at 
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(6) 



(14) 



t=l 



Considering again the evolution matrix P(l) = (p a _>.p) with a, (3 £ {A, C, G, T}, 
we state the following proposition that we prove in the Appendix, Section |A} 



Proposition 4.3 For (i) max a ^ g _4 ;Q ,^^ {p a -+{$) <S 1 and (ii) n large enough 
with n <C mm a /3eyl o^/3 (Pa->/3)j ^ e probability that a k-mer occurs at time 1 
while not occuring at time in a sequence of length n behaves quasi-linearly 
with respect to the length n. The convergence to this quasi-linear regime is 
exponential. 



4.7 Approach by automata of clumps 

We can alternatively use the construction of clumps by automata given in 
Bassino et al. [I]. 

For a set V = {v±, . . . ,v r } with correlation sets we construct a kind 
of "Aho-Corasick" automaton on the following set of words X 

X = {vi • w | 1 < i < r and w G {e} U Cij for some j}. 

The considered automaton T is built on X with set of states Q = Pref(X) 
and start or initial state s = e. The transition function is defined (as in the 
Aho-Corasick construction) by 

5(p,x) = the longest suffix of px £ Pref(AT). 

We build with this construction, for any £;-mer b, the automaton recognizing 
the clumps of the neighbours d(b) of b while avoiding occurrences of b; this 
last condition can be made effective by doing an automaton product. Assum- 
ing that the set of states of the resulting automaton T is Q = {0, 1, . . . , s} 

2 We use the classical equation Pr(A|_B) = Pr(A)/Pr(_B) for two events A and B such 
that B C A. 
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Figure 2: Automaton for constrained clumps of d(AAA) = {AAC, ACA, CAA}. 
Double circles signals an occurrence of one of these words. Transitions cov- 
ered by tildes (A, C) emits a signal counting a putative-hit position. The 
missing transitions A have been erased since we want to avoid occurrences 
of b = AAA. The missing transitions C point to the state x- All states are 
terminal. 

and that the initial state is labelled 0, we set all the states of the automaton 
T to terminal to recognize all sequences avoiding b. Therefore, we have 

T = ({0,l,...,s},M,{0,l,..., S }. 

See Figure[2]for an example with the alphabet A = {A, C}, the k-mer b = AAA 
and d(b) = {AAC, ACA, CAA}. Transitions with a "tilde" correspond to finding 
a new putative-hit position in the last recognized occurrence of a word of 
d(b). 

Clump-Core. We consider the set of states O that recognize an occur- 
rence of d(b), 

O = {q, 5(0, w) =q, we X}. 
We also consider the set of states E that do not belong to a clump extension, 

E = {q, 5{0, w)=q, we Pref (d(b))}, 

where Pref(d(6)) is the set of strict prefixes of words of d{b). 
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We define finally the Clump-Core of the automaton by its set of states 
E which verifies 

E = Q\E. 

Referring to the automaton of Figure [2J we have E = {0, 1, 2, 16 (x), 6, 12} 
and E = {3, 4, 5, 7, 8, 9, 10, 11, 13, 14, 15}. 

Markov property. By construction of the automaton, for any word w 
with I iz7 1 < |6|, we have the following property, 

J flw' 7^ w with (\w'\ = \w\) 
I such that 5(qi,w) = 5(q2,w') = e. 



Ve G E, \/w with (|u;| < 



This property can be proved iteratively with respect to the length of the 
words. 

Handling the putative-hit positions. For simplicity, we assume that 
there is only one type of mutation, but the method extends to the general 
case. We count as previously the putative-hit positions by the variable t. 

For each state o G O (recognizing an occurrence of d(b)), let 9(o) be the 
word w with \w\ < \b\, of maximal length, and verifying, 

1. there exists q such that 5(q, w) = o, 

2. there is no u G Pref(w) such that S(q,u) G O. 

By the Markovian property, this defines a unique word. Referring to Fig- 
ure^ we have (9(7) = ACA, 9(5) = AA, 0(14) = C, and 9(15) = A. Moreover, 
the Markovian property asserts that reading backward \b\ transitions from 
any state o G O does a reverse spelling of a unique word of d(b). We can 
next locate the putative-hit position within this word and check if it belongs 
to 9(o). 

The adjacency matrix M(t) = (hij(t)) associated to the automaton T 
is then defined as follows: hij(t) = if there is no transition from i to j; 
elsewhere, assume that 6(i, a) = j. We have then 

3 ?0, 

j G O and 9(j) contains no putative-hit position, 
Pr(a) x t elsewhere. 

The generating function Fb(z,t) defined in Equation §5§ verifies 

F b (z, t) = (1, 0, . . . , 0) x (I + zM(t) + • • • + z n U n (t) + . . . ) x 1* 
= (1,0,..., 0) X (I-zM^))- 1 x 1*. 



hij(t) 



Pr(a) if 
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5 Yet another proof by automata 



We sketch a proof that does not make use of clumps. The construction is 
computationally very costly. 

We build the (pruned) Knuth-Morris-Pratt automaton K recognizing 
A*bA* (the set of sequences avoiding the k-mer b). 

Next we compute the order-(2|6| — 1) Markov automaton M of K. The 
transitions of this automaton are words of size 2\b\. It is possible by reading 
the transitions to know when a new putative-hit position is present, and to 
multiply the corresponding entry in the associated adjacency matrix by the 
counting variable t. Let M(i) be this matrix. The matrix associated to the 
automaton K is positive, irreducible and transitive; so is the matrix M(i), 
disregarding a trie- like structure leading to its recurrent part. Writing p mu t 
the probability of mutation, we can make the substitution t (1 — p mn t) + 
x x p mu t- We then have for the recurrent part M.(t) of M(i), 

U(x) := M((l - p mut ) + x x p mut ) = Y + xp mut X. 

Assuming that n x p mu t = o(l), we get for a polynomial P(x) 

U n (x) = Y n + xn^wY^X + x 2 P(x) x O ((np mut ) 2 ) . (15) 

Writing and £ y the dominant eigenvalues of U(l) and Y, the property 
npmut = o(l) entails that = £™ x (1 + o(l)). We then deduce from 



Equation (15) that 



x 1 1,0,..., U n (x 1* , , M 

Pn ~ / , ; ; ' = (a + Bxn) x (l + o(l)). 

F (1,0, ...,0)U n 1 1* y H 1 y y " 



6 Conclusion 

We provided several methods for analysing waiting times in DNA evolution 
that give insights in the structure of the problem. We showed that clump 
analysis and generating functions are powerful and convenient tools for this 
aim and used either language analysis methods or automata constructions. 
In particular we proved the property of quasidinearity related to the prob- 
ability of first occurrence of a fc-mer after a unit time. 



Acknowledgements. We thank Sarah Behrens and Cyril Nicaud for help- 
ful discussions and technical help. 
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A Singularity analysis 



The methods developed in Section |4.6| apply to any /c-mer with any finite 
alphabet. Moreover, using the additivity of expectations, we could split 
the putative- hit positions along their types; with the toy alphabet {A, C}, 
we would get two putative- hit positions type, (A — > C) and (C — > A). By 
following the same footsteps as in Section |4.6[ we can now compute the 
expectations of putative-hit positions E (f#"* C) ) and E (#^ A) ) which 
correspond to the two types of mutation. Considering only the case (A — > C), 
we can by pattern-matching compute Kij [z, i(A->c)) • We have as previously 
Vi (z,t( k -+ C )) = Pr(vi)z\ v *h( k ^ c y 

We write in the following for sake of simplicity x = £(a^c) i an d consider 
the generating function Fb{z,x) = Fb(z, i(A->c)) where the function Fb{z,t) 
is defined in Equation §5§. 

The solutions of the Regnier-Szpankowski equations provide functions 
that are rational |^} Similarly, each term of the matrix Equation (11) is 



rational and so are the corresponding extensions to counts of putative-hit 
positions that lead to the explicit value of Fb(z,x). 

We can therefore write for two polynomials P(z, x) and Q(z, x) 



F b (z,x) = -p^ and F 6 (*, 1) = £/S - 

Q(z,x) ^ Q(z,l 



where, again, is the probability that a random sequence of length n has 
no occurrence of b. We have 

E(z) = ^E(H n jz =g^y- Q 2 {zA) • 



n>0 



This series has only positive coefficients and by Pringsheim Theorem [3] [Th. 
IV. 6, p. 240], it has a real positive singularity on the circle of convergence that 
we note r; by considering the automaton recognizing A*bA*, the associated 
irreducible and primitive matrix, and Perron- Frobenius properties of positive 
matrices [8], this real positive singularity is dominant. The singularity r is 
also the smallest positive solution of Q(z, 1) = 0. 

Therefore, extracting the nth Taylor coefficient of the generating func- 
tions E(z) and Fb(z, 1) by Cauchy integrals along a circle of radius r < R < 



3 This property follows also from an equivalent approach by finite automata and use of 
the Chomski-Schiitzenberger algorithm [10] that leads to solve a linear system of equations. 
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T2, where r 2 is the value of the second largest singularity (ies) in modulus, 
we obtain for constants t/j, 4>\ and 4>2 

/f^xrt^ll + O^)), (B<1), 

and E(H^) = [z n ]E(z)=T- n ((f> 1 xn + (t>2)x(l + 0(B n )). (16) 
It follows then immediately that 

E = E(H^) //„ = (d x n + 02) x (1 + O (B n )) , (5 < 1). 

(17) 

In the more general case, we have, for n <C min Q ^ g ^ {p a -+p)i 

p n « ^ E(#(^))xp a _ >/3 (l) = (C 1 xn+C 2 )x(l + 0(K")), (K<1), 

where C\ and C2 are constants, and K is the maximum of the |*4|(|.4.| — 1) 
constants B used when applying the Equation (17) to the |^4| (\A\ — 1) types 
of mutation. 



This proves Proposition 4.3 




Figure 3: Asymptotic linear behaviour of the unconditionned r\ n (left) and condi- 
tionned rj n = f] n /f n (right) expectations of the number of putative- hit positions 
for b = ACAC (plain red lines) and b' = AACC (blue dots) with the alphabet {A, C} 



and Pr(A) = Pr(C) = 1/2. See Equations (13), @ and (17 1. 
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B Toy example for the clump approach by lan- 
guage 

We consider the following toy example 

A = {A, C}, b = AAA, b' = ACC, Pr(A) = Pr(C) = -. 



We want to estimate the expectations of the total number of putative-hit 
positions (A— >C) and (C— >A) for the words b and b' . 

Equation ^ becomes, with A™ the subset of sequences of size n of A^, 

z" 
2< 



(18) 



As mentioned earlier, putative-hit positions only occur in the clumps, and 
therefore the core of differences between the behaviour of the 3-mers b = AAA 
and b' = ACC come from differences in the matrices of codes "K^ and "Ky . 
We have 



b = ACAC 

d t (b) = (AAAC, ACAA, ACCC, CCAC) 
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(i f (6)(z,t) = <i f (6')(2,t) = ( 



zH zH zH zH 
16 ' 16 ' 16 ' 16 



The last equations [^intuitively suggests that there should be more putative- 
hit positions in a random sequence for b = ACAC than for b' = AACC. This 
is verified in Figure [3] (left) where we plot the unconditionned expectations 
of the number of putative-hit positions for both 4-mers. However, when 
conditionning as in Figure [3] (right), the 4-mer ACAC get lower expectations 
than the 4-mer AACC; this follows from the values of the constants C\ for 
b and b' that respectively are C\ = 0.2452503893 for b = ACAC and Ci = 
0.3068491678 for b' = AACC. 

Figure [3] moreover exhibits the linear behaviour of these expectations 



with respect to the length n of the sequences, as stated in Proposition 4.3 



4 The extension AC £ A3acaa,aaac as in ACAAjAC leads to no new putative-hit position. The 
same remark applies to the extension AC G /Caccc.ccac- 
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